Search CORE

UCL Discovery

Warwick Research Archives Portal Repository

Optimality Driven Nearest Centroid Classification from Genomic Data

Author: A Alizadeh
Alan R. Dabney
AR Dabney
AR Dabney
B Efron
C Ambroise
C Stein
D Ross
I Hedenfalk
J Khan
J Schäfer
Ji Zhu
John D. Storey
JW Lee
K Mardia
P Bickel
R Shen
R Tibshirani
RJ McKay
RJ McKay
S Dudoit
T Golub
TH Bø
Y Guo
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers

CiteSeerX

Texas A&M Repository

A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods

Author: A Hess
AA Fodor
AR Dabney
C Li
DB Allison
DP Gaile
E Hubbell
E Schuster
F Leisch
G Smyth
L Shi
LM Cope
P Baldi
RA Irizarry
RA Irizarry
RC Gentleman
Richard D Pearson
S Hochreiter
S Lemieux
SE Choe
T Sing
VG Tusher
X Liu
X Liu
Z Chen
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws. Results We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison. Conclusion We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p

An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

Author: A Vlahou
Alan R. Dabney
Anthony P. Leclerc
AR Dabney
B Efron
B Rosner
B Wu
BL Adam
C Strobl
C Strobl
D Agranoff
DS Palmer
EF Petricoin
EJ Finehout
Elizabeth G. Hill
ET Fung
Fabio Rapallo
G Izmirlian
GA Churchill
H Zhang
JM Koomen
Jonas S. Almeida
JR Quinlan
JS Morris
L Breiman
L Breiman
L Breiman
L Li
LE Breiman
M Hilario
MR Segal
PJ Adam
RW Garden
S Schaub
SK Lee
TM Pawlik
TP Conrads
V Svetnik
Y Yasui
YD Chen
Yuliya V. Karpievitch
YV Karpievitch
YV Karpievitch
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

Texas A&M Repository

Simple and flexible classification of gene expression microarrays via Swirls and Ripples

Author: AR Dabney
C Sonnenschein
D Singh
D Stekel
DJ Hand
DJ Hand
DM Rocke
E Yeoh
H Tanimoto
KH Johnson
M Mori
MH Sandel
MS Pepe
PA Lachenbruch
R Kohavi
S Dudoit
S Ma
S Michiels
SG Baker
SG Baker
SG Baker
SG Baker
SG Baker
SL Pomeroy
Stuart G Baker
TR Golub
U Alon
Wolfram Research Inc
Y Guo
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Blood Signature of Pre-Heart Failure: A Microarrays Study

Author: A Caporali
A Mosterd
A Yndestad
Annie Turkieh
AR Dabney
AR Whitney
Atul Pathak
C Cappuzzello
C Luers
CC Liew
Charlotte Trouillet
Clement Delmas
D Lloyd-Jones
DM Kaye
Fatima Smih
FD Hobbs
Franck Desmoulin
François Koukoui
I Betti
J Quackenbush
Jason Iacovoni
Jean Ferrieres
JS Gottdiener
JV McMurray
K Dickstein
LB Daniels
M Davies
MA Pfeffer
Matthieu Berry
Michel Galinier
N Sharpe
NB Schiller
OJ Marshall
Olivier Lairez
P Philip-Couderc
P Verdecchia
Philippe Rouet
Pierre Massabuau
PU Seiler
R Konig
Romain Harmancey
S Varma
SA Hunt
Shahab A. Akhter
TA McDonagh
TJ Wang
TJ Wang
VG Tusher
ZY Fang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

International audienceBACKGROUND: The preclinical stage of systolic heart failure (HF), known as asymptomatic left ventricular dysfunction (ALVD), is diagnosed only by echocardiography, frequent in the general population and leads to a high risk of developing severe HF. Large scale screening for ALVD is a difficult task and represents a major unmet clinical challenge that requires the determination of ALVD biomarkers. METHODOLOGY/PRINCIPAL FINDINGS: 294 individuals were screened by echocardiography. We identified 9 ALVD cases out of 128 subjects with cardiovascular risk factors. White blood cell gene expression profiling was performed using pangenomic microarrays. Data were analyzed using principal component analysis (PCA) and Significant Analysis of Microarrays (SAM). To build an ALVD classifier model, we used the nearest centroid classification method (NCCM) with the ClaNC software package. Classification performance was determined using the leave-one-out cross-validation method. Blood transcriptome analysis provided a specific molecular signature for ALVD which defined a model based on 7 genes capable of discriminating ALVD cases. Analysis of an ALVD patients validation group demonstrated that these genes are accurate diagnostic predictors for ALVD with 87% accuracy and 100% precision. Furthermore, Receiver Operating Characteristic curves of expression levels confirmed that 6 out of 7 genes discriminate for left ventricular dysfunction classification. CONCLUSIONS/SIGNIFICANCE: These targets could serve to enhance the ability to efficiently detect ALVD by general care practitioners to facilitate preemptive initiation of medical treatment preventing the development of HF

HAL-Inserm

Reduction of the contaminant fraction of DNA obtained from an ancient giant panda bone

Author: AR Quinlan
AW Briggs
Axel Barlow
BM Kemp
Georgios Xenikoudakis
Guilian Sheng
H Jónsson
H Li
H Li
J Dabney
J Dabney
JL Barta
L Mamanova
Lingfeng Song
LS Bell
M Hofreiter
M Martin
M Salamon
M-T Gansauge
Michael V. Westbury
ML Carpenter
MPA Davis
MTP Gilbert
NG Jablonski
Nikolas Basler
P Korlević
PB Damgaard
R Li
S Boessenkool
S Ohnishi
TL Fulton
XQ Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2017
Field of study

Objective: A key challenge in ancient DNA research is massive microbial DNA contamination from the deposition site which accumulates post mortem in the study organism’s remains. Two simple and cost-effective methods to enrich the relative endogenous fraction of DNA in ancient samples involve treatment of sample powder with either bleach or Proteinase K pre-digestion prior to DNA extraction. Both approaches have yielded promising but varying results in other studies. Here, we contribute data on the performance of these methods using a comprehensive and systematic series of experiments applied to a single ancient bone fragment from a giant panda (Ailuropoda melanoleuca). Results: Bleach and pre-digestion treatments increased the endogenous DNA content up to ninefold. However, the absolute amount of DNA retrieved was dramatically reduced by all treatments. We also observed reduced DNA damage patterns in pre-treated libraries compared to untreated ones, resulting in longer mean fragment lengths and reduced thymine over-representation at fragment ends. Guanine–cytosine (GC) contents of both mapped and total reads are consistent between treatments and conform to general expectations, indicating no obvious biasing effect of the applied methods. Our results therefore confirm the value of bleach and pre-digestion as tools in palaeogenomic studies, providing sufficient material is available

Nottingham Trent Institutional Repository (IRep)

FigShare

A Multi-Cancer Mesenchymal Transition Gene Expression Signature Is Associated with Prolonged Time to Recurrence in Glioblastoma

Author: A Singh
A Subramanian
AP Morel
AR Dabney
C Scheel
CJ Creighton
D Anastassiou
Darrell J. Yamashiro
Dimitris Anastassiou
E Buck
H Kim
H Peinado
HS Phillips
HW Yang
J Anido
Jeffrey K. Harrison
Jessica J. Kandel
JH Taube
JP Thiery
JP Thiery
LA Cooper
M Jechlinger
M Zoller
MR Alison
Peter Canoll
RG Verhaak
SA Mani
SA Mikheeva
TM Cover
Wei-Yi Cheng
Y Xu
Publication venue: Public Library of Science
Publication date
Field of study

A stage-associated gene expression signature of coordinately expressed genes, including the transcription factor Slug (SNAI2) and other epithelial-mesenchymal transition (EMT) markers has been found present in samples from publicly available gene expression datasets in multiple cancer types, including nonepithelial cancers. The expression levels of the co-expressed genes vary in a continuous and coordinate manner across the samples, ranging from absence of expression to strong co-expression of all genes. These data suggest that tumor cells may pass through an EMT-like process of mesenchymal transition to varying degrees. Here we show that, in glioblastoma multiforme (GBM), this signature is associated with time to recurrence following initial treatment. By analyzing data from The Cancer Genome Atlas (TCGA), we found that GBM patients who responded to therapy and had long time to recurrence had low levels of the signature in their tumor samples (P = 3×10−7). We also found that the signature is strongly correlated in gliomas with the putative stem cell marker CD44, and is highly enriched among the differentially expressed genes in glioblastomas vs. lower grade gliomas. Our results suggest that long delay before tumor recurrence is associated with absence of the mesenchymal transition signature, raising the possibility that inhibiting this transition might improve the durability of therapy in glioma patients

Pre-processing Agilent microarray data

Author: A Oshlack
AA Dombkowski
Agilent
AR Dabney
BA Rosenzweig
David Berman
Edward Schaeffer
G Delenstarr
G Smyth
G Smyth
G Smyth
GC Tseng
Giovanni Parmigiani
GK Smyth
I Lonnstedt
J Freudenberg
J Quackenbush
K Dobbin
KK Dobbin
Leslie Cope
LM Cope
LX Qin
Marianna Zahurak
ML Martin-Magniette
R Development Core Team
R Scharpf
RA Irizarry
RA Irizarry
RA Irizarry
Robert B Scharpf
S Dudoit
SE Choe
Shabana Shabbeer
T Bammler
W Tong
Wayne Yu
Y Yang
YH Yang
YH Yang
Publication venue: BioMed Central
Publication date: 01/05/2007
Field of study

Abstract Background Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction software with pre-processing methods that have become the standard for normalization of cDNA arrays. These include log transformation followed by loess normalization with or without background subtraction and often a between array scale normalization procedure. The larger goal is to define best study design and pre-processing practices for Agilent arrays, and we offer some suggestions. Results Simple loess normalization without background subtraction produced the lowest variability. However, without background subtraction, fold changes were biased towards zero, particularly at low intensities. ROC analysis of a spike-in experiment showed that differentially expressed genes are most reliably detected when background is not subtracted. Loess normalization and no background subtraction yielded an AUC of 99.7% compared with 88.8% for Agilent processed fold changes. All methods performed well when error was taken into account by t- or z-statistics, AUCs ≥ 99.8%. A substantial proportion of genes showed dye effects, 43% (99%<it>CI </it>: 39%, 47%). However, these effects were generally small regardless of the pre-processing method. Conclusion Simple loess normalization without background subtraction resulted in low variance fold changes that more reliably ranked gene expression than the other methods. While t-statistics and other measures that take variation into account, including Agilent's z-statistic, can also be used to reliably select differentially expressed genes, fold changes are a standard measure of differential expression for exploratory work, cross platform comparison, and biological interpretation and can not be entirely replaced. Although dye effects are small for most genes, many array features are affected. Therefore, an experimental design that incorporates dye swaps or a common reference could be valuable.</p

Benzoxazinoids in Root Exudates of Maize Attract Pseudomonas putida to the Rhizosphere

Author: A Dechesne
A Friebe
A Friebe
AI Saeed
Andrew L. Neal
AR Dabney
AV Morant
B Lugtenberg
BJJ Lugtenberg
CAM Robert
Ching-Hong Yang
CS Harwood
CS Harwood
D Sofie
D Szklarczyk
DL Jones
F Fischer
FZ Haichar
G Glauser
GN Agrios
HM Niemeyer
HM Niemeyer
HM Niemeyer
J Handelsman
J Maresh
JI Rangel-Castro
JM Whipps
Jurriaan Ton
K Anzai
KE Nelson
L Molina
LB Bjostad
M Erb
M Erb
M Frey
M Miyakoshi
MA Matilla
MA Matilla
MD Woodward
MI Ramos-Gonzalez
NN Gerber
P Kumar
PM Wolanin
R Gagliardo
RE Parales
RG Linderman
Ruth Gordon-Weeks
S Ahmad
S de Weert
SCM Van Wees
Shakoor Ahmad
SS Krogh
SS Pao
T Nakazawa
T Rudrappa
VG Tusher
W Blankenfeldt
WR Chase
YY Song
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Benzoxazinoids, such as 2,4-dihydroxy-7-methoxy-2H-1,4-benzoxazin-3(4H)-one (DIMBOA), are secondary metabolites in grasses. In addition to their function in plant defence against pests and diseases above-ground, benzoxazinoids (BXs) have also been implicated in defence below-ground, where they can exert allelochemical or antimicrobial activities. We have studied the impact of BXs on the interaction between maize and Pseudomonas putida KT2440, a competitive coloniser of the maize rhizosphere with plant-beneficial traits. Chromatographic analyses revealed that DIMBOA is the main BX compound in root exudates of maize. In vitro analysis of DIMBOA stability indicated that KT2440 tolerance of DIMBOA is based on metabolism-dependent breakdown of this BX compound. Transcriptome analysis of DIMBOA-exposed P. putida identified increased transcription of genes controlling benzoate catabolism and chemotaxis. Chemotaxis assays confirmed motility of P. putida towards DIMBOA. Moreover, colonisation essays in soil with Green Fluorescent Protein (GFP)-expressing P. putida showed that DIMBOA-producing roots of wild-type maize attract significantly higher numbers of P. putida cells than roots of the DIMBOA-deficient bx1 mutant. Our results demonstrate a central role for DIMBOA as a below-ground semiochemical for recruitment of plant-beneficial rhizobacteria during the relatively young and vulnerable growth stages of maize

CiteSeerX